Skip to content

Fix deferrable Beam Dataflow operators failing with 400 when job ID is missing from stdout#69102

Open
gingeekrishna wants to merge 6 commits into
apache:mainfrom
gingeekrishna:feature/68279-fix-deferrable-beam-dataflow-job-id
Open

Fix deferrable Beam Dataflow operators failing with 400 when job ID is missing from stdout#69102
gingeekrishna wants to merge 6 commits into
apache:mainfrom
gingeekrishna:feature/68279-fix-deferrable-beam-dataflow-job-id

Conversation

@gingeekrishna

@gingeekrishna gingeekrishna commented Jun 28, 2026

Copy link
Copy Markdown

Closes #68279

Problem

When the Dataflow launcher subprocess runs with the default WARNING log level, it does not emit the "Created job with id: [...]" line that the Beam operator parses to capture the Dataflow job ID. This leaves dataflow_job_id = None.

The previous PRs (#67711, #68720) addressed this by adding a fallback after the launcher subprocess finished — but as reviewer @MaksYermak correctly noted, that is not the root cause fix: by the time the launcher exits, the Dataflow job may have already completed, so deferral never gets a chance to free the Airflow worker.

Root Cause Fix

The correct fix is to capture the job ID during the stdout-reading loop, before the launcher finishes, so the operator can truly defer.

Changes

providers/google/.../hooks/dataflow.py

  • Add DataflowHook.fetch_job_id_by_name(prefix_name, location, project_id) — looks up a Dataflow job by name prefix via the API, returning its ID.

providers/apache/beam/.../hooks/beam.py

  • Add import time
  • Add periodic_callback: Callable[[], None] | None = None parameter to run_beam_command(), _start_pipeline(), start_python_pipeline(), and start_java_pipeline()
  • In run_beam_command(): invoke periodic_callback() roughly every 5 seconds while the subprocess is running (using time.monotonic() tracking). After each periodic call, check is_dataflow_job_id_exist_callback() and exit early if the ID has been resolved — before the subprocess finishes.

providers/apache/beam/.../operators/beam.py

  • Add BeamDataflowMixin.__get_dataflow_job_id_poll_callback(): returns a closure that calls DataflowHook.fetch_job_id_by_name() and sets self.dataflow_job_id when a matching job is found; silently retries on transient errors.
  • Update BeamRunPythonPipelineOperator.execute_on_dataflow() and BeamRunJavaPipelineOperator.execute_on_dataflow() to create and pass this callback.

How this fixes the issue

  1. Beam launcher starts (with DataflowRunner) — the Dataflow job is submitted.
  2. Every ~5 s, the periodic callback polls the Dataflow Jobs API by job name prefix.
  3. Once the job appears, dataflow_job_id is set, is_dataflow_job_id_exist_callback() returns True, and the stdout-reading loop exits immediately — before the Dataflow job finishes.
  4. The operator defers successfully, freeing the Airflow worker. The Dataflow job continues running on Google Cloud.

This path is the same whether or not the launcher emits a job-ID line to stdout. If stdout does emit the line, process_line_callback sets dataflow_job_id and the loop exits on the next is_dataflow_job_id_exist_callback() check, as before.

Tests

  • Updated hook-level tests to include periodic_callback=None in run_beam_command mock assertions (all callers that don't pass a periodic_callback).
  • Updated operator-level test_exec_dataflow_runner tests to include periodic_callback=mock.ANY.
  • Added test_exec_dataflow_runner_periodic_callback_fetches_job_id for both BeamRunPythonPipelineOperator and BeamRunJavaPipelineOperator: captures the periodic_callback passed by the operator, calls it directly, and asserts that dataflow_job_id is set by polling fetch_job_id_by_name.

Checklist

  • Root cause fixed (not a post-hoc fallback)
  • Backward compatible: periodic_callback defaults to None; existing callers are unaffected
  • Go operator not touched (it is sync-only, no execute_on_dataflow)
  • Syntax validated with py_compile
  • Newsfragment added: providers/apache/beam/newsfragments/68279.bugfix.rst

… missing from stdout

When the Dataflow launcher process runs with WARNING log level (the default),
it does not emit the "Created job with id" line that the Beam operator parses
to capture the Dataflow job ID. This left dataflow_job_id as None, causing the
deferrable trigger to fail with "400 Request must contain a job and project id".

Fix by adding a periodic_callback parameter to run_beam_command() that is
invoked roughly every 5 seconds while the launcher subprocess is running. The
deferrable Beam operators now pass a callback that polls DataflowHook.fetch_job_id_by_name()
to resolve the job ID by name. Once the ID is set, the stdout-reading loop exits
early so the operator can truly defer, freeing the Airflow worker while the
Dataflow job continues running on Google Cloud.

Fixes apache#68279
@gingeekrishna gingeekrishna requested a review from shahar1 as a code owner June 28, 2026 06:07
Copilot AI review requested due to automatic review settings June 28, 2026 06:07
@boring-cyborg boring-cyborg Bot added area:providers provider:apache-beam provider:google Google (including GCP) related issues labels Jun 28, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Comment thread providers/apache/beam/newsfragments/68279.bugfix.rst Outdated

@SameerMesiah97 SameerMesiah97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just scanned the diff, and 2 things just came to mind:

  1. Since this polling happens before the operator can defer, does this not increase the amount of time a worker slot is occupied? One of the main benefits of the deferrable path is freeing the worker as early as possible.
  2. This feels like it's introducing a waiter via a generic callback. If we're going this far, why not introduce a dedicated waiter/helper instead?

I think these higher-level concerns need to be addressed before a line-by-line review.

@gingeekrishna

gingeekrishna commented Jun 29, 2026

Copy link
Copy Markdown
Author

Thanks for the review, @SameerMesiah97 — good questions.

1. Worker slot occupancy

The polling does not add any extra occupancy time. The worker was already occupied for the full duration of the launcher subprocess run before my change — run_beam_command has always blocked on select.select(reads, [], [], 5) in a loop until the subprocess exits. My periodic_callback runs inside that same existing loop, within the 5-second select timeout that was already there.

If anything, this fix can only reduce occupancy: when the job ID is resolved via the Dataflow API before the launcher finishes printing it to stdout, run_beam_command returns early (see line 204) and the operator can defer sooner.

2. Generic callback vs. dedicated waiter

The periodic_callback is intentionally generic because run_beam_command lives in the Beam hook, which should not know about the Dataflow API. A "dedicated waiter" would either need to couple Dataflow API knowledge into that generic hook, or be a wrapper class around run_beam_command that ends up doing the same thing less directly.

The callback pattern is already the established idiom here — process_line_callback and is_dataflow_job_id_exist_callback both pre-existed my change and follow the same shape. periodic_callback is just a timed variant of the same mechanism.

Happy to rename it or restructure if you have a specific shape in mind for the dedicated helper.

@MaksYermak MaksYermak left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gingeekrishna thank you for the PR. Have you run the dataflow system tests for this changes?

About solution we do not need additional callback we can use existing one is_dataflow_job_id_exist_callback and extends it with logic which is checking for dataflow_id using dataflow_job_name. Additionally, these changes for callback will work only with additional changes to several other methods in beam hook which needed because otherwise the fd blocks the execution process.

I almost finish the fix for this issue and prepare a PR later on this week

@gingeekrishna

Copy link
Copy Markdown
Author

Thanks for the detailed review, @MaksYermak!

We haven't run the full Dataflow system tests — we only exercised unit tests and the existing beam hook test suite locally.

Your point about using the existing is_dataflow_job_id_exist_callback to also drive the API poll is a cleaner design than introducing a separate periodic_callback, and the note about the fd-blocking concern is well taken.

Since you're close to finishing your own fix with the correct approach, we're happy to close this PR and defer to yours. We'd rather avoid parallel work that heads in incompatible directions. Just let us know if there's anything from this PR you'd find useful to carry forward, or if you'd prefer we simply close it now.

@MaksYermak

Copy link
Copy Markdown
Contributor

Thanks for the detailed review, @MaksYermak!

We haven't run the full Dataflow system tests — we only exercised unit tests and the existing beam hook test suite locally.

Your point about using the existing is_dataflow_job_id_exist_callback to also drive the API poll is a cleaner design than introducing a separate periodic_callback, and the note about the fd-blocking concern is well taken.

Since you're close to finishing your own fix with the correct approach, we're happy to close this PR and defer to yours. We'd rather avoid parallel work that heads in incompatible directions. Just let us know if there's anything from this PR you'd find useful to carry forward, or if you'd prefer we simply close it now.

@gingeekrishna here is a PR VladaZakharova#325 with changes. This PR is blocked by #66952 because in the current time apache-beam provider is suspended and firstly we need to un-suspend it for being able to run beam and dataflow unit tests, run dataflow system tests, and developing using breeze.

@gingeekrishna

gingeekrishna commented Jul 2, 2026

Copy link
Copy Markdown
Author

Thanks for the context @MaksYermak — that's really helpful to understand the full picture. Makes sense that everything is gated on the provider un-suspension in #66952 first.

Given that your approach at VladaZakharova#325 is the correct design and is already waiting behind #66952, Shall I go ahead and close this PR so there's no noise?. Once #66952 lands and the beam provider is active again, your fix can move forward cleanly without any parallel confusion from ours.

Let me know if there's anything from this PR you'd like to pull in, or if you'd like a review on your branch once the block is cleared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

5 participants